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0 .  Introduction 

The  problem  of  spectral  analysis  of  time  series  is  clearly 
of  great  interest  to  the  many  applied  scientists  who  use 
spectral  analysis  in  their  scientific  research.  It  should  be 
of  great  interest  to  statisticians  because  it  embodies 
prototypes  of  two  of  the  great  problems  of  modern  statistics: 
functional  inference  and  modeling.  A  problem  of  statistical 
inference  usually  assumes  three  ingredients:  a  sample  of 
observations,  a  parameter  which  indexes  the  family  of  possible 
joint  probability  densities  of  the  sample,  and  a  formula  for 
the  probability  density  of  the  sample 

f( sample | parameter) . 


Classical  statistical  inference  assumes  the  parameter  is  a 
finite  dimensional  parameter  6  =  (9^ , . . . , 0^) .  Functional 
inference  assumes  the  parameter  is  a  function,  such  as  f(w), 

0<U><1  . 

The  parameter  estimation  problem  seeks  to  form  optimal 

A 

estimators  (denoted  6)  of  the  parameter.  A  typical  model 
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identification  problem  seeks  to  find  the  smallest  number  of 
significantly  non-zero  components  0^  of  the  parameter  0. 

Estimation  of  a  function  often  has  similar  features  to 
model  identification,  since  a  function  can  be  parametrized 
exactly  by  a  countable  infinity  of  parameters.  However  in 
practice  one  can  only  efficiently  estimate  a  finite  number  of 
parameters.  Therefore  to  estimate  a  function  one  must  use 
the  smallest  finite  number  of  parameters  which  provide  an 
adequate  approximation  of  the  function. 

The  goals  of  functional  inference  and  model  identification 
are  in  my  view  best  pursued  simultaneously.  One  seeks  methods 
of  statistical  inference  which  are  finite-parametric  and 
non- (or  infinite-)  parametric.  One  achieves  this  goal  by 
using  finite  parameter  models  (which  theory  indicates  might 
be  exact  methods)  in  ways  that  enable  them  to  be  interpreted 
also  as  approximating  models. 

Autoregressive  spectral  estimation  is  one  of  the  new 
techniques  for  spectral  analysis  developed  in  the  last  two 
decades.  Its  theory  and  applications  are  extremely  extensive. 
This  article  aims  to  provide  an  overview  (rather  than  a  detailed 
account)  of  the  main  ideas.  A  comprehensive  bibliography  guides 
the  reader  to  articles  in  which  case  studies  and  comparisons  of 
autoregressive  spectral  estimators  are  described. 

The  spectral  density  is  defined  in  section  1.  Infinite 
order  AR  and  MA  representations  of  a  stationary  time  series 
are  introduced  in  section  2.  Entropy  as  a  motivation  for 


autoregressive  schemes  is  discussed  in  section  3.  Alternate 
parametrizations  of  an  autoregressive  scheme  are  outlined  in 
section  4.  Algorithms  for  computing  the  coefficients  of 
autoregressive  spectral  densities  are  stated  in  section  5. 
Criteria  for  determining  the  orders  of  autoregressive  schemes 
are  mentioned  in  section  6.  Suggestions  for  empirical 
spectral  analysis  are  outlined  in  section  7.  The  final 
section  provides  a  guide  to  the  literature  of  autoregressive 
spectral  estimation  by  listing  the  articles  which  correspond 
to  some  important  developments. 

1 .  Correlations  and  Spectral  Functions  of  a  Stationary  Time 
Series 

The  theory  of  time  series  discusses  separately  discrete 
parameter  time  series  {Y(t) ,  t=0,  +1,  +2,...}  and  continuous 
parameter  time  series  {Y(t),  -oo<t<«>}.  Only  the  former  case  is 
discussed  in  this  article.  Discrete  parameter  series  usually 
arise  by  observing  a  continuous  parameter  time  series  at  equi- 
spaced  times.  The  frequency  variable  w  of  a  pure  harmonic 

(1)  Y(t)  «  A  cos  2irwt  +  B  sin  2iru)t 

observed  at  t  *  0,  +1,...  can  be  assumed  to  vary  in  either 
-0.5  <  u  <  0.5  or  0  <  iii  <  1,  The  first  interval  is  usually 
adopted,  but  the  second  interval  will  be  adopted  in  this 
article  because  it  is  more  convenient  for  developing 
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isomorphisms  between  spectral  analysis  and  non-parametric  data 
modeling  using  quantile  and  density-quantile  functions  [see 
Parzen  (1979)]. 

The  definitions  and  notation  we  adopt  for  the  functions 
used  to  describe  a  zero  mean  stationary  Gaussian  discrete 
parameter  time  series  Y(t),  t=0,  +1,...  are  as  follows. 

A  "time  domain"  specification  of  the  probability  law  of 
Y(-)  is  provided  by  the  covariance  function 

(2)  R(v)  =  E[ Y (t) Y (t+v) ] ,  v=0 ,  +1,  ±2 _ ; 

or  by  the  variance  R(0)  and  the  correlation  function 

(3)  P(v)  =  =  Corr  [Y(t>,  Y(t+v)]. 

(  To^  define  ‘  spectral  (frequency)  domain  specification  of 
the  probability  law  of  Y(-)  we  first  assume  summability  of 
R(-)  and  p(-)-  The  Fourier  transforms  of  R(v)  and  p(v)  are 
called  the  power  spectrum  S(w)  and  spectral  density  function  f(u>) 
respectively,  and  are  defined  by 

(4)  S(u>)  =  l  e“27livu,R(v)  ,  0<o)<l; 

°o 

(5)  f(u>)  =  l  e'2lliVus  p  ( v )  ,  0<u><1. 

Vs-oo 

The  spectral  distribution  function  is  defined  by 

(6)  F (to)  =  /w  f(w')  do)',  0<w<l 

o 


For  data  analysis  one  actually  computes  a  modified  spectral 
distribution  function  denoted  F+(w)  and  defined  for  0<uj<0.5: 
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(7)  F+(ta>)  -  2F(u),  0<u<0.5. 

A  correlation  function  p(v)  has  a  mathematical  property  of 
being  a  positive  definite  function  which,  without  assuming 
summability,  guarantees  the  existence  of:  (1)  a  spectral 
distribution  function  F(u)),  and  (2)  the  spectral  representati 
of  the  correlation  functionp(v)  given  by 

(8)  P(v)  =  J1  e27rivu)  dF(w) 

o 

=  cos  2ttvuj  dF+(w) , 

o 

The  notion  of  an  ergodic  time  series  is  not 
definition  in  this  article  but  its  intuitive  meaning  is  important 
for  us.  We  call  a  time  series  ergodic  when  the  parameters  of 
its  probability  law  possess  consistent  estimators  (and  thus  can 
be  determined  with  probability  one,  given  a  sample  of  infinite 
length) .  An  example  of  a  non-ergodic  stationary  Gaussian  zero 
mean  time  series  is 

Y(t)  =  A  cos  2TTujt  +  B  sin  2Tru>t 

where  A  and  B  are  independent  N(0,o2)  random  variables.  One 
can  infer  the  values  of  A  and  B  in  the  sample  observed,  but 
one  cannot  infer  the  value  of  a2  .  This  time  series  has 
correlation  function 
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(9)  p(v)  =  cos  2itojv,  v~0,  +  1,  +  2,..., 

and  does  not  possess  a  spectral  density. 

Parzen  (1982)  proposes  that  it  is  useful  in  practice  to 
distinguish  qualitatively  between  three  types  of  time  series, 
which  we  call 

no  memory:  white  noise, 

short  memory:  stationary  and  ergodic , 

long  memory:  non- stationary  or  non-ergodic. 

A  no -memory  or  white  noise  time  series  is  a  stationary 
Gaussian,  time  series  satisfying  either  of  the  equivalent 
conditions : 

p(v)  =  0  for  v>0 ; 

(10) 

f(w)  =  1,0<u<1. 

A  short  memory  time  series  is  a  stationary  time  series 
possessing  a  summable  correlation  function  p(v)  and  a  spectral 
density  f(w)  which  is  bounded  above  and  below  in  the  sense 
that  the  dynamic  range  of  f(w) 

max  min 

(11)  DR(f )  =  0<w<l  f(u)  t  0<u)<l  f(w) 

satisfies  1<  DR(f)< 

A  long  memory  time  series  is  one  which  is  neither  no 
memory  nor  short  memory;  alternatively,  a  long  memory  time 
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series  is  one  which  is  non-stationary  or  non-ergodic.  It 
usually  has  components  representing  cycles  or  trends.  An 
example  of  a  long  memory  time  series  is  (1)  where  A  and  B 
are  independent  N(0,a2)  random  variables. 

For  a  short  memory  time  series  one  can  define  the  inverse- 
correlation  function 

(12)  pi (v)  =  J1  e2TTlvti)  f-1(u))du>  r  f1  f~ 1  (w)  du> 

°  o 

and  the  ceps tral- correlation  function 

(13)  y (v)  =  ]_0g  f(u)  dw 

o 

It  should  be  noted  that  the  inverse-correlation  function  is 
positive  definite.  However  the  cepstral-correlation  function 
is  not.  These  new  types  of  correlation  functions  are 
introduced  because  they  may  provide  more  parsimonious 
parameterizations  in  the  sense  that  they  decay  to  0  faster 
than  does  the  correlation  function.  Statistical  inference 
from  a  sample  of  the  probability  law  of  a  time  series  often 
achieves  greatest  statistical  efficiency  by  using  the  most 
parsimonious  parametrizations .  Thus  to  form  estimators  f(u>) 
of  the  spectral  density  f(w)  from  a  raw  estimator  f(a>), 
greater  precision  may  be  attained  by  first  forming  estimators 
{f  *"(<*>)  K  and  (log  f(w)}~  of  the  inverse  or  logarithm  of  the 
spectral  density.  Autoregressive  spectral  estimation  may  be 
regarded  as  an  approach  to  estimating  f(u>)  by  first 
estimating  f"^(ui). 


2 .  Filter  Models  of  a  Short  Memory  Stationary  Time  Series 

A  short  memory  zero  mean  Gaussian  stationary  time  series 
Y(t)  has  a  property  of  fundamental  importance:  it  can  be 
linearly  transformed  to  a  white  noise  time  series,  denoted 
Yv(t)  or  e ( t ) ,  by  a  filter  that  is  invertible.  The  definition 
of  YV(t)  is  provided  by  the  theory  of  minimum  mean  square  error 
prediction . 

The  definitions  and  notation  of  prediction  theory  that 
we  adopt  are  as  follows: 

(1)  .  Yy’m  (t)  =  E[Y(t) | Y(t-l) , . . . , Y(t-m) ] 

denotes  the  memory  m  one-step  ahead  predictor  while 

(2)  Yv,tn  (t)  -  Y (t)  -  Yy,m(t) 
is  the  prediction  error,  and 


(3)  «  E  [  |  Yv  ’  m(t)  |  2  ]  ♦  E [  [  Y(t)  |  2  ] 

is  the  mean  square  prediction  error  measured  in  units  of  the 
variance  R(0)  of  Y(t).  Corresponding  infinite  memory  notation 
is 

(4)  Yy(t)  *  E[Y(t) | Y(t-l) ,  Y(t-2) , . . . ] 

(5)  Yv(t)  =  Y(t)  -  Yy(t) 

(6)  a2  -  E[ j  Yv (t) j  2 ]  v  E [ j  Y (t) | 2 ] . 


Explicit  formulas  for  the  foregoing  predictors, 
prediction  errors,  and  mean  square  prediction  errors  can  be 
obtained  from  the  correlation  function  p(v).  The  autoregressive 
coefficients  am(l) , . . . , am(m)  of  order  m  are  defined  by 

(7)  -Yp’m(t)  =  am(l)  Y(t-l)  +...+am(m)  Y(t-m) , 

(8)  YV'm  (t)  =  Y (t)  +  am(l)  Y(t-l)+...+om(m)  Y(t-m) . 

A  predictor  is  characterized  by  the  condition  that  the 
prediction  error  is  orthogonal  (normal)  to  the  predictor 
variables : 

(9)  E[Yv>m(t)Y(t-k)]  =  0,  k-1 . m 

By  substituting  (8)  into  (9)  one  obtains  the  famous  Yule-Walker 
equations,  defining  am(0)  =  1. 

(10)  I  a  (j)  p(j-k)  =  0,  k-1 . m 

j-o  m 

One  obtains  o*  by 
m  J 

(11)  o*  -  E[YV'm(t>  Y(t))t  E[  |Y(t)  j  z]  =  |0%(j)  p(j) 

For  a  short  memory  time  series,  these  equations  also  hold  with 
m=°° . 

The  time  series  of  infinite  memory  prediction  errors 
Yv(t)  is  always  a  white  noise  series  called  the  innovations . 
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It  provides  a  transformation  of  the  time  series  Y(t)  to  a 
white  noise  time  series  Yv(t)  which  we  write 

(12)  Y(t)  +  0^(1)  Y(t-l)  +.  .  .  -  Yv(t), 

and  call  an  AR(°°)  or  infinite  order  autoregressive  scheme  for 
Y(t).  An  MA(°°)  or  infinite  order  moving  average  scheme 
representation  is 

(13)  Y(t)  =  Yv(t)  +  6^(1)  Yv(t-1)  +.  .  . 

whose  coefficients  ^(k)  can  be  determined  recursively  from 
0*0)  by  6^  (0)  =  1,  and  for  k>0 

k 

(14)  B0o(k)  +  .(j)  8^  (k-j)  =  0 

The  AR(°°)  and  MA(°°)  representations  have  important 
implications  for  spectral  analysis  since  they  provide  formulas 
for  the  spectral  density  function  f(w)  alternative  to  the 
formula  that  f(w)  is  the  Fourier  transform  of  p (v) .  One 
can  show  that 

(15)  f(»)  -  oJh„(e2,,i")|2  , 

(16)  f'hu)  -  Og^e2"1")!2  , 

defining 

-  ?  800(j)  zj,  gjz)  =  l  “a, (k)  zk- 

j=0  k=0 


(17) 


^(z) 
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These  infinite  series  converge  in  general  in  mean  square  on 
the  unit  interval.  In  order  to  guarantee  pointwise  convergence 
at  each  ui  in  0<w<l  one  must  make  an  additional  smoothness 
condition  such  as  a  Lipshitz  condition  on  f(u)),  which  is  implied 
in  turn  by  the  condition 

oo 

(18)  l  |  V I  I  p  (v>  |  <00 

v=  -  00 

In  spectral  analysis  we  usually  assume  at  least  the  existence 
of  a  continuous  second  derivative,  which  is  implied  by  the 
condition 

(19)  l  |  v |  2  |  p(v)  |  <  «. 

V=-oo 

The  notion  of  a  time  series  Y(-)  being  an  autoregressive 
scheme  of  order  p,  denoted  AR(p) ,  can  be  defined  in  terms  of 
prediction  theory  as  follows:  Y(*)  is  an  AR(p)  if  and  only  if 
the  memory  p  prediction  errors  Yv’^(*)  is  a  white  noise  time 
series  and  ctp(p)  f  0.  The  spectral  density  of  Yv’^(*)  can  be 
expressed  in  terms  of  the  autoregressive  transfer  function  of 
order  p 

(20)  g  (z)  =  a  (0)  +  a  (1)  z+.  .  .+a  (p)  zp 

p  P  P  P 

by 
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<21)  f,v,p  <“>  ■  5J  Igpte2”1")!1  fU) 

If  the  time  series  Y(*)  is  in  fact  AR(p) ,  then  its  spectral 
density  equals  the  function 

(22)  fp<»)  -  o*  I gp(e2lr*w)  | ~ 1 

which  we  call,  in  general,  the  approximating  autoregressive 
spectral  density  of  order  p.  A  time  series  Y(*)  can  be 
regarded  as  approximated  by  an  AR(p)  if 

(23)  V>  ■ 

r' 

can  be  regarded  as  not  "significantly"  different  from  the 
constant  1.  In  this  way  a  test  of  the  hypothesis  that  a  time 
series  Y(*)  is  AR(p)  can  be  converted  to  a  test  of  the  hypothesis 
that  the  prediction  error  time  series  is  white  noise. 

The  sequence  of  approximating  autoregressive  spectral 
densities  fm(w)  ,  m=*l,2,...  may  be  shown  to  converge  as  m  tends 

to  »  to  f (w)  at  each  w  in  0<uk1  under  suitable  conditions  (see 
especially  Nevai  (1979)).  Sufficient  conditions  are  that  f(u>) 
has  finite  dynamic  range  (and  therefore  is  bounded  above  and 
below)  and  has  a  continuous  derivative.  When  an  estimator, 

A 

denoted  fm(w),of  fm(uj)  is  used  as  an  estimator  of  f(u)),  one  has 
to  take  into  account  two  kinds  of  errors,  called  respectively 
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bias  and  variance.  Bias  is  a  measure  of  the  deterministic 
difference  between  fm(u))  and  f(w);  while  variance  is  a  measure 

/v 

of  the  stochastic  distance  between  f  (oj)  and  f  (w) .  As  m 

m  m 

increases  bias  decreases  while  variance  increases.  This  is  an 
example  of  the  fundamental  problem  of  empirical  spectral 
naalysis  which  is  how  to  achieve  an  optimal  balance  between 
bias  and  variance.  When  one  uses  autoregressive  spectral 
estimation,  this  problem  reduces  to  a  question  of  determining 
the  order  m  of  the  approximating  autoregressive  scheme,  which 
is  discussed  in  section  6. 

It  should  be  noted  that  basic  references  for  the 
mathematical  properties  (especially  convergence)  of  autoregressive 
transfer  functions  g^z)  are  Geronimus  (1960)  and  Grenander  and 
Szego  (1958)  . 
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3.  Entropy  Interpretation  of  Autoregressive  Spectral  Densities 
The  use  of  autoregressive  spectral  densities  as  exact 
models,  and  as  approximating  models,  for  true  spectral  densities 
is  often  questioned  by  sceptical  statisticians  on  the  ground 
that  their  use  in  general  is  ad  hoc  and  without  theoretidal 
justification.  A  possible  answer  to  this  criticism  is  provided 
by  entropy  concepts.  This  section  reviews  these  concents  in  order 
to  state  their  application  to  spectral  estimation. 

The  notion  of  entropy  in  statistics  is  usually  first  defined 
for  a  discrete  distribution  with  probability  mass  function  p(x) . 
The  entropy  of  this  distribution,  denoted  H(p) ,  is  defined  by 

(1)  H(p)  =  -  l  p (x)  log  p (x) 

x 

For  the  distribution  of  a  continuous  real  valued  random  variable 
X,  with  probability  density  function  f(x),  entropy  is  defined 
(analogously  or  formally)  by 

(2)  H(f)  =  -/°°  f(x)  log  f(x)  dx 

—  00 

A  concept  closely  related  to  entropy  is  information 
divergence  I(f;g)  between  two  probability  densities  f(x)  and 
g(x) ,  defined  by 

(3)  I(f;g)  =  /"(- log  f^y>  f(x)  dx 


One  should  note  that  I(f;g)  equals  minus  the  generalized 
entropy  H(f jg)  defined  by 

(4)  H(f | g)  -  /"<-§$-  log  g(x)  dx 

Another  fundamental  concept  is  cross-entropy  defined  by 

(5)  H(f ; g)  =  /“  {-log  g(x)}  f(x)  dx. 

-  OO 

Note  that  H(f)  =  H(f;f). 

Information  divergence  is  expressed  in  terms  of  cross¬ 
entropy  and  entropy  by 

(6)  Kf;g)  =  H(f ; g)  -  H(f) 

Important  Information  Inequality: 

(7)  I(f;g)  >  0 

with  equality  if  and  only  if  f  =  g;  consequently 

(8)  H(f )  <H(f;g) 

Proof:  Apply  Jensen’s  inequality  which  states  for  an 
arbitrary  function  h(x) 


(9) 


/°°  (log  h(x) }  f  (x)  dx  £  log  r  h(x)  f(x)  dx 
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with  equality  if  and  only  if  h(x)  =  1  for  almost  all  x  with 
respect  to  the  probability  density  f(x) . 

Some  applications  of  entropy  in  probability  and  statistical 
modeling  are  as  follows. 

The  method  of  maximum  likelihood  parameter  estimation  can 
be  described  abstractly  as  follows.  One  introduces  a  parametric 
family  of  probability  densities  f0(x),  indexed  by  a  vector 
parameter  0=  (0^, . . . , 0^) .  Suppose  there  is  a  true  parameter 
value  1)  in  the  sense  that  the  true  probability  density 
f(x)  =  f^(x)  .  Then  "0  satisfies 

(10)  H(f )  =  H(f;f?)  =  mjn  H(f;f0). 

To  estimate  ¥  from  data,  one  forms  an  estimator  H(f;f0)  of 

A 

H(f;fQ)  and  defines  an  estimator  0  of  0  by 
y 

(11)  H(f;  fg)  =  H  (f ;fQ) . 

The  estimator  H(f;  fQ)  could  be  of  the  form 

y 

(12)  H(f ;  fQ)  ~  H(f ;  fQ) 

for  a  suitable  raw  estimator  f(x)  of  f(x). 

The  parametric  families  of  probability  densities  fQ(x) 
are  often  derived  axiomatically  using  a  maximum  entropy  principle  * 
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1 


Theorem:  Fix  k  functions  Tj (x) ,  j-l,2,...,k,  and  k  real 

numbers  T2*-'-,Tk  suc^  that  there  exists  probability 
densities  f(x)  satisfying 

(13)  /“  T. (x)  f(x)  dx  =  t . ,  j«l,...,k. 

—  OO  **  " 

The  density  with  maximum  entropy  H(f)  among  these  densities 
is  of  the  form 

k 

(14)  log  fQ(x)  =  ,Ii0jTj<x)  “  . 6k) 

where 

k 

(15)  V(01,...,e.)  -  log  /“dx  exp  l  e.T.(x)  , 

^  -00  j  =  l  J  J 

and  6j_ . 0k  are  chosen  to  satisfy 

(16)  /“  T . (x)  f  _ (x)  dx  =  x.  ,  J«l,...,k. 

—  00  J  °  J 


Proof :  The  theorem  may  be  proved  by  calculus  of  variations 
arguments.  A  quick  proof  is  to  verify  that  for  any  f(x) 
satisfying  the  moment  constraints  (13) 

k 


(17) 


H(f ; 


f9> 


He,, - ek)  -  l  e.T, 

1  K  j-l  J  J 


H(fe). 


and  therefore 
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(18)  H(f)  <  H(f;f0)  -  H(fQ) . 

Thus  the  maximum  entropy  is  achieved  by  f0(x). 

Natural  Exponential  models.  A  parametric  family  of  probability 
densities  fQ(x)  is  said  to  obey  a  natural  exponential  model 
when  it  is  of  the  form  (14) .  Thus  natural  exponential  models 
are  maximum  entropy  probability  densities. 

To  extend  entropy  concepts  to  short  memory  stationary  zero 
mean  Gaussian  time  series,  define  the  information  divergence 
for  a  sample  (Y(t) ,  t=l,2,...,T)  as  a  function  of  the  true 
probability  density  f  of  the  sample,  and  a  model  g  for  f.  We 
define 


Kf;  8)  =  iT(f;g) 


iT(f 


;g)  ~  “T  Ef  ^ 


g(Y(l) . Y  (t)  ) 


It  should  be  noted  that  we  are  using  the  notation  f  and  g  with 
a  variety  of  meanings.  For  a  Gaussian  zero  mean  stationary  time 
series,  the  probability  density  of  the  sample  is  specified  by 
the  spectral  densities  f(ui)  of  the  true  distribution  and  g(u>) 
of  the  model.  The  arguments  of  the  information  divergence 
I ( f ; g)  indicate  spectral  densities  in  the  following  discussion. 
Pinsker  (1963)  derives  the  following  very  important  formula: 


(21) 


i(£:g)  -  {M  - log  -11  d“ 
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Since  u  -  log  u  -  1  _>  0  for  all  u,  I  has  two  of  the  properties 
of  a  distance:  I(f;g)  0,  I(f;f)  *  0.  However  I  does  not 
satisfy  the  triangle-inequality. 

We  define  the  cross-entropy  of  spectral  density  functions 
f(u>)  and  g  (to)  by 

(22)  H(f;g)  =  \  fl  {log  g(w)  +  1^-}  do) 

The  entropy  of  f  is 

(23)  H(f)  =  H(f ;  f )  =  \  I1  {log  f(u>)  +  1}  du 

o 

Information  divergence  can  be  expressed 
(2A)  I (f ; g)  =  H(f;g)  -  H(f ) 

Hence  • 

(25)  H(f)  <  H(f ;g) 

An  approximating  autoregressive  spectral  density  of  order 
m,  denoted  fm(u))  ,  to  a  spectral  density  f(w)  is  defined  by 

(26)  H(f:fm)  -  “in  H(f;fm) 

m 

where  the  minimization  is  over  all  f  of  the  form 
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(27)  *„<“>  -  <&  ign<e2l,i“>r1 . 

(28)  ^(z)  -  1  +  Omd)  z+*  •  zm 


One  may  verify  that 

(29)  H(f;fm)  “  \  {1°8  am  +  ^  Z1  gjfe2*1")  |2  f  Cui)  doj} 

The  coefficients  a2  ,  a  (D . a(m)  of  the  minimum  cross- 

m  m  m 

entropy  approximating  autoregressive  spectral  density  satisfy 


(3°)  =  J1  |gm(e27rlw)  1 2  f(u)  dw 


m 


/'  gm<®2,,1“)  £(“>  d“  •  .L  VJ>  <>«> 

o  j“0 


min  J‘|g  (e2l,iw)  |  2  f(u>)  dtu, 
8m  o  ^ 


,1—  ,  2niwx  _-2iTiku)  ct  \  j,, 

(31)  J  g^e  )e  *(<*>)  du) 

o 


m 


=  l  a  (j)  p(j-k)  =  0,  k=l , 2 . m 

j-0  m 


Further 


(32)  H(f;Tm)  -  \  {log,  +  1}  =  H(fm) 

The  autoregressive  spectral  density  Tm((u)  can  be  derived 
axiomatically  using  a  maximum  entropy  principle. 


Theorem:  The  spectral  density  with  maximum  entropy  among 

all  spectral  densities  f(co)  satisfying  the  constraints 

(33)  I1  e2niwj  f(u>)  dw  =  p(j),  j-1,2 . m 

o 

for  m  specified  correlation  coefficients  ~ (1) , . . . , p(m)  is 

V“> 

Proof:  It  may  be  verified  that  T  (o>)  satisfies  the 

constraints  (33),  and  (22)  holds  for  any  f(a))  satisfying  (33). 
Since 

(34)  H(f )  <  H(f ;Tm)  =  H(Tm) , 

it  follows  that  fm  has  maximum  entropy  among  all  spectral 
densities  satisfying  the  constraints  (33). 

The  maximum  entropy  principle  provides  a  motivation  or 
justification  for  the  use  of  autoregressive  spectral  estimators. 
However  the  maximum  entropy  principle  provides  no  insight  into 
how  to  identify  an  optimal  order  m,  or  even  what  are  the  effects 
of  different  methods  of  estimating  the  parameters  o^. 
am(l)  ,  .  .  .  ,  am(m)  .  It  provides  no  guidance  for  how  to  learn  from 
the  data  whether  the  time  series  is  non-stationary  (long  memory) 
or  stationary  (short  memory) ,  or  whether  the  best  time  series 
model  is  AR,  MA,  or  ARMA.  It  is  a  principle  for  deriving 
probability  models,  rather  than  statistically  fitting  models  to 
data.  Further,  the  maximum  entropy  principle  justifies  auto¬ 
regressive  estimators  only  for  short  memory  time  series. 
Autoregressive  estimators  are  justified  for  long  memory  time 
series  by  the  fact  that  a  pure  harmonic  Y(t)  =  A  cos  2ir<*)t 
+  B  sin  2iru>t  satisfies  Y(t)-<J>Y(t-l)+Y(t-2)-0  where  $  -  2  cos  2ttw. 
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4.  Parametrizations  of  Autoregressive  Spectral  Estimators 

There  are  many  ways  for  forming  autoregressive  spectral 
estimators,  because  there  are  four  equivalent  ways  of  parametrizing 
them. 

A.  Autoregressive  coefficients.  Consider  coefficients 

a^Cl) , . . . ,am(m)  such  that  g(z)  =  1  +am(l)  z+. . .+am(m) zm  satisfies 

g(z)  i*  0  for  complex  z  such  that  |z|  <_  1.  We  call  g(z)  a 

minimum  phase  filter  transfer  function.  The  autoregressive 

coefficients  determine  a*  by 

m 

1  '  I'  lgm(e2,,1“)|-2d«, 
o 

One  computes  the  correlation  coefficients  p(l) , . . . , p(m)  by 

p(j)  *  J1  exp  (2TTiu)j)  o*  l g—  ( e2 71  laJ)  j  ”  dw. 
o 

B.  Correlations.  A  set  of  m  coefficients  p(l) , . . . , p(m) 
such  that  the  matrix  [with  p(0)  =  1,  p(-v)  =  p(v)] 


~p(0) 

p(-l)  •  •  • 

p(l-m) 

p(l) 

p(0)  .  .  . 

p(2-m) 

_ p(m-l) 

p(m-2)  .  .  . 

p(0) 

is  positive  definite  are  correlation  coefficients  of  a  time  series 
They  determine  autoregressive  coefficients  by  solving  the  Yule 
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Walker  equations  [with  am(0)  =  1] 

l  a  ( j )  p(j-k)  =  0,  k«l,...fm. 

j=0  m 

m 

One  computes  a *  by  =  £  am(j)  p(j). 

j=0 

C.  Partial  Correlations  Consider  coefficients 
rr(l)  ,  .  .  .  ,  7t(m)  satis  fying  ju(j  )  |  <  1,  j=l , 2 , . . . ,m.  They  represent 
partial  correlation  coefficients  defined  theoretically  by: 

Tt(j)  =  Corr  [ Y(t)  ,  Y(t-j>  jY(t-l)  ,  .  .  .  ,  Y(t-j+l)  ] 

Partial  correlation  coefficients  determine  autoregressive 
coefficients  and  residual  variances  by  a  recursive  algorithm 
called  the  Levinson-Durbin  recursion  [see  Levinson  (1947)  and 
Durbin  (I960)]:  for  k=l 

0^(1)  -  -tt(1),  oj  ■  1  -tt2(1), 

while  for  k=2,  3,...,m 

,  k-1 

<*k(k)  =  -7r(k)  =  - —  l  ak-1  (j)  p(k-j) 

ak-l  j==0 

ak  =  ak-l  U-ir'OO) 

ak(j)  =  ak_!(j)  -  *00  ak_1  (k-j) 
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Autoregressive  coefficients  determine  partial  correlation 
coefficients  by  the  recursion  [see  Barndorf f-Nielsen  and  Schon 
(1973)] . 

ak_i<j)  =  (1-tt2  (Ic)}” 1  (ak(j)  +  ir(k)  ak(k-j)> 

D.  Residual  variances.  Consider  coefficients 
, o| ..... satisfying 

1  >  a,2  >  ol> .  .  .  >a2  >  0 
L  Z  m 


and  m  coefficients  representing  sign  tt(1) . sign  n  (m)  .  The 

o's  represent  residual  variances;  they  determine  partial 
correlation  coefficients  by  a  formula  noted  by  Dickenson  (1978) 


7t(k)  =  sign  Tr(k) 
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5 .  Empirical  Autoregressive  Spectral  Estimation 

Given  a  sample  (Y(t),  t*=l,  2,  . .  .  ,T}  of  a  zero  mean  Gaussian 
stationary  time  series  whose  spectral  density  f(oj)  is  to  be 
estimated  by  an  autoregressive  spectral  density  estimator 

V“>  -  lsm(e2,,1“>r  . 

g^z)  =  1  +  am(l)  z+...+am(m)  zm  , 

we  define  the  order  identification  problem  to  be  the  choice  of 
order  m,  and  the  parameter  estimation  problem  to  be  the  choice 
of  algorithm  for  computing  the  coefficients  &m(l) , . . . , a  (m)  and 
the  residual  variance  3'. 

For  a  sample  Y(l) , . . . , Y(T)  of  a  zero  mean  Gaussian 
stationary  time  series,  an  approximation  for  the  joint 
probability  density  function  f 0 (Y(l) , . . . , Y(T) )  indexed  by  a 
parameter  6  is  obtained  as  follows.  We  assume  that  the  time 
series  Y(t)  has  been  divided  by  {R(0>}^  so  that  its  covariance 
function  equals  its  correlation  function.  Then 

-2  log  f q (Y(l)  ,  .  .  .  , Y (T) )  =  log  (2tt)T  det  K0+  Y*  K-1YT 

where  *  denotes  complex  conjugate  transpose,  Y*  =  (Y(l)  . Y(T)) 

K0  =  E  Y|  Y*  is  a  covariance  matrix  with  (s , t) -element  equal 
to  p0  (s-t) .  The  subscript  6  on  pQ(v)  and  f0(u>)  indicate  that 
they  are  functions  of  the  parameters  9,  which  are  to  be  estimated 
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The  covariance  matrix  l^is  a  Toepliz  matrix.  Asymptotically, 
as  T  tends  to  all  T  by  T  Toeplitz  matrices  have  the  same 
eigenvectors  exp  (-27titj/T),  j-0,1,...,  T-l.  The  corresponding 
eigenvalues  are  f0(j/T).  An  approximation  for  likelihood 
function  frequently  adopted  is  therefore 

“T  1o8  f0(Y(l),...,Y(T)) 

=  \  log  2n  +  |  f1  {log  f0<«)  +  du 

o  0 

=  \  log  2 it  +  H(f;f0) 

where  f(u>)  is  the  sample  spectral  density  defined  by 

T  T 

f(w)  -  l  | Y(t)  exp  (-2T7ituj)  | 2  i  l  Y2(t) 
t*l  t=l 

Maximum  likelihood  estimators  6  are  asymptotically  equivalent 

A 

to  the  estimators  0  minimizing  the  sample  cross-entropy 
H(f ; f Q) . 

If  the  parametric  model  f0(w)  is  assumed  to  be  the  spectral 

A  A  A 

density  of  an  AR(p) ,  then  estimators  a^,  ap(l) , . . . ,ap(p)  of  the 
coefficients  satisfy  Yule-Walker  equations  corresponding  to  the 
sample  correlation  function 
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p ( v)  ■=  J1  e27TitjV  f(u)  dw 

o 

T-v  T 

=  l  Y(t)  Y(t+v)  t  I  Y2(t) . 
t=l  t=l 

The  sample  correlation  function  p(v)  can  be  computed,  using  the 
Fast  Fourier  transform,  by 

-v  1  ^  k  ~  k 

p(v)  =  q  X  exp  ( 2iri  q  v)  f(^) 

which  holds  for  0  _<  v  <  Q  -  T. 

Ic  should  be  noted  that  we  are  assuming  the  time  series 
Y(t)  to  be  zero  mean,  or  more  generally  to  have  been  detrended 
by  subtraction  of  y(t),  an  estimator  of  the  mean  value 
function  y(t)  -  E[Y(t)].  When  y(t)  *  y,  a  constant,  we  take 

/v  __ 

y  =  Y.  When  y(t)  is  a  function  with  period  d  (as  might  be 
the  case  with  d=12  for  monthly  time  series)  one  might  take 

A 

for  y(t)  the  mean  of  Y(s)  for  all  s=t  modulo  d. 

By  recursively  solving  the  Yule-Walker  equations  one  can 
determine  sequences  of  (1)  estimated  residual  variances 


> 


> .  •  •  >  a 


m 


(2)  estimated  partial  correlations 


tt(1)  ,  w(2) ,  .  . . , tt (m) ,  . . . 
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(3)  estimated  autoregressive  coefficients 


a  (0)  =  1,  am(l) , . . . ,am(m) . 


(4)  autoregressive  spectral  density  estimators 


m 


*m(a,)  =  5m  LIq  am(j>  exP  2iri^ 


(5)  residual  spectral  densities 


V->  • 

m  f  (w) 


m 


3y  forming  a  smoothed  version  fm(to)  of  fm(u>)  one  can  obtain  a 

/s 

final  estimator  f(u>)  of  the  unknown  spectral  density: 


f(w)  =  fm(«) 


When  f(u)  is  tested  for  white  noise,  and  found  not  to  be 
significantly  different  from  white  noise,  then 


f (u)  «  fm(w) « 

and  the  autoregressive  spectral  density  estimator  is  the  final 
estimator. 

The  important  question  of  criteria  for  choosing  the  orders 
of  approximating  spectral  densities  is  discussed  in  the  next 


section. 
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Computing  estimators  of  autoregressive  coefficients  by 
solving  Yule-Walker  equations  is  called  stationary  autoregression 
because  the  autoregressive  coefficients  obtained  are  guaranteed 
to  correspond  to  a  stationary  time  series.  When  in  the 
foregoing  analysis  is  tending  to  approximate  0,  we  consider 
the  time  series  to  be  long  memory;  experimental  evidence 
indicates  that  more  reliable  estimators  of  the  spectral  density, 
and  also  of  the  autoregressive  coefficients,  are  provided  by 
least-squares  autoregression,  which  solves  the  normal  equations 

K  (0,0)  .  .  .  K(0,m) 

K  (1,0)  .  .  .  K(l,m) 

•  ••  ••• 

K(m, 0)  . . .  K(m,m) 

A 

for  a  suitable  estimator  K(i,j)  of 

K(i,j)  =  E[Y(t-i)  Y(t-j)] 

Possible  estimators  (for  i,  j=0,l,...,m)  are:  least  squares 
forward  algorithm 

T— in*  X 

K(i,j)  *=  A,  ~l-  Y(t+m-i)  Y (t+m-j) 
t=0 

or  least  squares  forward  and  backward  algorithm 

T”tD—  X 

K(i,j)  2 (17H)  Y (t+m-j )  +Y(t+i)  Y(t+j)} 


When  several  harmonics  are  present  in  the  data,  whose 
frequencies  are  close  together,  least  squares  autoregressive 
coefficient  estimators  are  more  effective  than  Yule-Walker 
autoregressive  coefficient  estimators  in  providing  autoregressive 
spectral  estimators  which  exhibit  the  split  peaks  one  would  like 
to  see  in  the  estimated  spectral  density. 

An  important  and  popular  algorithm  for  estimation  of  AR 
coefficients  was  introduced  by  Burg  in  1967  [see  Burg  (1967), 
(1968)].  For  references  to  descriptions  of  Burg's  algorithm, 
see  Kay  and  Marple  (1981) . 
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6 .  Autoregressive  Order  Determination 

The  problem  of  determining  the  orders  of  approximating 

autoregressive  schemes  is  an  important  example  of  the  problem 

of  estimating  a  function  by  using  the  smallest  finite  number  of 

parameters  which  provide  an  adequate  approximation  of  the 

function.  The  true  spectral  density  is  denoted  f(cj)  or 

fM(u))  .  An  approximation  "F(u)  is  defined  by  assuming  a  family 

of  densities  fQ  a  (uj)  which  are  functions  of  m  scalar 

91  *  *  * ,6m 

parameters  0^, . . . ,0  .  The  parameter  values  6^, . . . f 6m  which  minimize 

the  cross- entropy  H(f;f.  )  define  a  best  approximating 

°1‘ ‘ 


spectral  density  fm(oi)  = 


.  Q  —  (w)  .  An  estimator  of  f 

0i  »  • ■  •  .  0  m 

1  m 


is  f 


m 


(cx) )  =  f~  2  (w)  where  6,,...,  6  minimizes  H(f;f  ^ 

°1 ’ ' ' ‘ ’ m  °1 ’  4  4  4  ’ °m' 4 

A 

To  evaluate  the  properties  of  f m(w)  as  an  estimator  of  , 

one  must  distinguish  two  kinds  of  error.  The  model  approximation  or 
bias  error  is 


B(m)  -  Kf.;  fn). 

The  parameter  estimation  error  or  variance  is 

V(m,  T)  -  E  I 


As  m  tends  to  <*>,  B(m)  tends  to  0  and  V(m,  T)  tends  to  The 

A  ,  A 

optimal  value  m  minimizes  ElCf^;  fm)  as  measured  by 
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C(m)  -  B(m)  +  V(m, T) 


In  practice,  one  forms  an  estimator  C(m)  of  the  terms  in  C(m) 
which  depend  on  m. 

a 

One  calls  C(m)  a  criterion  function  for  order  determination. 
It  should  be  plotted,  and  interpreted  as  a  function,  not  just 
examined  for  its  minimum  value.  It  is  useful  to  define  a 

A 

best  value  of  m  (at  which  C(m)  is  minimized)  and  a  second 

A 

best  value  of  m  (at  which  C(m)  achieves  its  lowest  relative 
minimum) . 

A 

One  also  has  to  define  a  value  C(0)  of  the  criterion  function 
at  m=0.  If 


C (m)  >  C(0)  for  m=l,2,... 

then  the  optimum  order  is  0,  and  the  time  series  is  considered 
to  be  not  significantly  different  from  white  noise.  Further 
research  is  required  on  the  properties  of  order  determining 
criteria  as  tests  for  white  noise. 

Tests  for  white  noise  provide  an  alternative  approach  to 

A 

order  determination  since  an  autoregressive  estimator  fm(uO  is 
regarded  as  an  adequate  fit  (or  smoother)  if  the  residual 

—■  A 

spectral  density  f(w)  *  fm(w)  is  not  significantly  different 
from  the  sample  spectral  density  of  white  noise. 

A  widely  used  order  determining  criterion  is  that 
introduced  by  Akaike  [see  Akaike  (1974)].  It  should  be 
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emphasized  that  Akaike's  criterion  had  a  different  conceptual 
basis  than  the  one  outlined  above;  it  seeks  to  determine  the 
order  of  an  exact  autoregressive  scheme  which  the  time  series  is 
assumed  to  obey.  Then  one  can  raise  the  objection  against  it  that 
it  does  not  consistently  estimate  the  order,  which  is  done  by  a 
criterion  due  to  Hannan  and  Quinn  (1979).  Our  point  of  view  is 
that  the  approximating  autoregressive  scheme  need  only  have  the 

**  A 

property  that  f(w)  *  fm(u>)  is  just  barely  not  significantly 
different  from  the  sample  spectral  density  of  white  noise. 

Akaike's  order  determining  criterion  AIC  is  defined  by 

AIC  (m)  =  log  ^  ,  m  >  1 

Possible  definitions  for  AIC(O)  are  0  or  -1/T. 

The  Hannan  and  Quinn  criterion  is 

AICHQ(m)  =  log  ^  log  log  T  . 

Parzen  (1974),  (1977)  introduced  an  approximating 
autoregressive  order  criterion  called  CAT  (criterion  autoregressive 
transfer  function),  defined  by 

CAT (m)  =  £  (l4>  3“*  -  (1-Jji)  o'*  ,  m>l 

CAT (0)  =  -  (1  +  £). 

In  practice  CAT  and  AIC  lead  in  many  examples  to  exactly 
the  same  orders.  It  appears  reassuring  that  quite  different 
conceptual  foundations  can  lead  to  similar  conclusions  in 
practice. 


The  basic  aim  of  spectral  analysis  is  to  obtain  an  estimated 
spectral  density  which  does  not  introduce  spurious  spectral 
peaks,  and  resolves  close  spectral  peaks.  To  arrive  at  the 
final  form  of  spectral  estimator  in  an  applied  problem,  auto¬ 
regressive  spectral  estimators  can  be  used  to  identify  the 
memory  type  of  a  time  series  (long,  short,  or  no  memory)  and 
the  type  of  the  whitening  filter  of  a  short  memory  time  series 
(AR,  MA,  or  ARMA) .  An  empirical  time  series  spectral  analysis 
should  involve  the  following  stages. 

A.  Pre-processing .  To  analyze  a  time  series  sample 
Y (t) ,  t=l,...,T,  one  will  proceed  in  stages  which  often  involve 
the  subtraction  of  or  elimination  of  strong  effects  in  order  to 
see  more  clearly  weaker  patterns  in  the  time  series  structure. 

The  aim  of  pre-processing  is  to  transform  Y(*)  to  a  new  time 
series  Y(*)  which  is  short  memory.  Some  basic  pre-processing 
operations  are  memory  less  transformation  (such  as  square  root 


and  logarithm),  detrending,  "high  Dass"  filtering, 

1  T 

One  usually  subtracts  out  the  sample  mean  Y  =  ^  1 


and  differencing. 
Y (t) ;  then  the 


time  series  actually  processed  is  Y(t)  -  Y. 


Sample  Fourier  Transform  by  Data  Windowing,  Extending 


with  Zeroes,  and  Fast  Fourier  Transform.  Let  Y(t)  denote  a 


pre-processed  time  series.  The  first  step  in  the  analysis  could 


be  to  compute  successive  autoregressive  schemes  using  operations 
only  in  the  time  domain.  An  alternative  first  step  is  the 
computation  of  the  sample  Fourier  transform 


I 
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(1) 


T 

ij>(w)  =  l  Y(t)  exp  (-2niut) 
t=l 


w  ■=  q  ,  k  =  0,  1 ,  .  .  .  , Q- 1 , 


at  an  equi-spaced  grid  of  frequencies  in  0£w£l .  We  call  Q  the 
spectral  computation  number.  One  should  always  choose  Q>T,  and  we 
recommend  Q>_2T.  Prior  to  computing  <Kw)  ,  one  should  extend  the 
length  of  the  time  series  by  adding  zeroes  to  it.  Then  ^  (u> )  can 
be  computed  using  the  Fast  Fourier  transform. 

If  the  time  series  may  be  long  memory  one  should  compute 
in  addition  a  sample  "tapered"  or  "data  windowed"  Fourier 
transform 


(2)  ipw(w)  =  ^  Y(t)W(£)  exp  (-2iriu)t) 


C.  Sample  Spectral  Density.  The  sample  spectral  density 
f(w)  is  obtained  essentially  by  squaring  and  normalizing  the 
sample  Fourier  transform; 


(3) 


f(«*>) 


*  y 

<J  ki0 


U)= 


i(^)|2 


Q’ 


k-0,  1 . Q-l 


D.  Sample  Correlation  Function.  The  sample  correlation 

A 

function  p(v)  is  computed  (using  the  Fast  Fourier  Transform). 

E.  Autoregressive  analysis.  The  Yule-Walker  equations  are 

✓V 

solved  to  estimate  innovation  variances  o^,  to  which  are  applied 
order  determining  criteria  (AIC,  CAT)  to  determine  optimal 

A  A 

orders  m  and  also  to  test  for  white  noise.  The  value  of  o*.  and 


the  dynamic  range  of  the  autoregressive  spectral  estimator 

A 

f^(uj)  are  used  to  determine  the  memory  type  of  the  time  series. 

A 

Two  orders  (called  the  best  m  and  second  best  m  (2))  are 
determined  as  candidates  as  optimal  orders  corresponding  to  the 
absolute  minimum  and  lowest  relative  minimum  of  the  criterion 
function. 

F.  ARMA  analysis.  When  a  time  series  is  classified  as 

A 

short  memory,  an  approximating  AR  scheme  of  order  4m  is 
inverted  to  form  MA  (°°)  coefficients  which  are  used  to  estimate 
covariance  matrix  of  Y(t-j)  and  Yv(t-k).  A  subset  regression 
procedure  is  then  used  to  determine  a  "best  fitting"  ARMA 
scheme,  and  the  corresponding  ARMA  spectral  density  estimator. 

One  will  be  able  to  identify  moving  average  schemes  and  ARMA 
schemes  which  are  barely  invertible,  and  require  a  long  AR  scheme 
for  adequate  approximation.  The  long  AR  spectral  estimator 
introduces  spurious  spectral  peaks  when  compared  to  the  MA  or 
ARMA  estimator. 

G.  Non-stationary  autoregression.  When  a  time  series  is 
classified  as  long  memory,  more  accurate  estimators  of  auto¬ 
regressive  coefficients  are  provided  by  minimizing  a  least 
squares  criterion  or  by  Burg  estimators.  When  several  harmonics 
are  present  in  the  data,  whose  frequencies  are  close  together, 
least  squares  autoregressive  coefficient  estimators  are  more 
effective  than  Yule-Walker  autoregressive  coefficient  estimators 
in  providing  autoregressive  spectral  estimators  which  exhibit 
the  split  peaks  one  would  like  to  see  in  the  estimated  spectral 
density. 


H.  Long  Memory  analysis.  In  the  long  memory  case,  one  may 
want  to  represent  Y(t)  as  S(t)  +  N(t),  a  long  memory  signal  plus 
short  memory  noise.  An  approach  to  this  problem  may  be  provided 
by  treating  the  sample  spectral  density  values  f(k/Q)  as  a  data 
batch  to  be  studied  by  non-parametric  data  modeling  methods  using 
quantile  functions  [see  Parzen  (1979)].  The  details  of  such 
methods  are  under  development. 

I.  Nonparametric  kernel  spectral  density  estimator.  An 
estimator  f(w)  of  the  spectral  density  is  called:  parametric 
when  it  corresponds  to  a  parametric  model  for  the  time  series 
(such  as  an  AR  or  ARMA  model);  non-parametric  otherwise.  A 
general  form  of  non-parametric  estimator  is  the  kernel  estimator 

f(u>)  «  l  k(g)  p  (v )  e"2ltiuV  ,  0<w<l. 

V=  — oo 

The  problem  of  determining  optimum  truncations  points  M  has  no 
general  solution;  one  approach  is  to  choose  M  =  4m  to  obtain  a 
preliminary  smoothing  of  the  sample  spectral  density. 

J .  Inverse  correlations  and  cepstral  correlations. 

Estimators  of  pi(v)  and  y(v)  are  computed  and  used  to  form 
nonparametric  kernel  estimators  of  f  ^(u>)  and  ’  log  f(u>),  which 
may  provide  additional  insights  into  the  peaks  and  troughs  to 
be  given  significance  in  the  final  estimator  of  the  spectrum. 

Extensive  comparisons  of  different  methods  of  spectral 
estimation  are  given  in  Pagano  (1980),  Priestley  and  Beamish 
(1981),  Kay  and  Marple  (1981).  It  seems  clear  that  autoregressive 
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spectral  estimators  can  give  superior  results  when  properly  used. 
One  should:  determine  two  best  orders;  compute  autoregressive 
coefficients  by  Yule-Walker  equations  and  by  least  squares 
since  when  the  time  series  is  long  memory  autoregressive  spectral 
estimators  are  most  accurate  when  based  on  least  squares 
estimators  of  autoregressive  coefficients;  use  approximating 
autoregressive  schemes  to  determine  if  an  ARMA  scheme  fits 
better. 

The  end  of  the  story  of  the  search  for  the  perfect  spectral 
estimator  seems  attainable  if  one  does  not  think  of  spectral 
estimation  as  a  non-parametric  procedure  which  can  be  conducted 
independently  of  model  identification. 


i 
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8 .  A  Bibliography  of  Autoregressive  Spectral  Estimation 

The  biblioeraphv  aims  to  provide  a  comprehensive  list  of 
the  publications  in  English  which  are  directly  concerned  with 
developing  the  theory  and  methods  of  autoregressive  spectral 
esatimation. 

This  section  lists  some  of  the  publications  which 
contributed  to  the  development  of  AR  spectral  estimation. 

Yule  (1927)  introduces  autoregressive  schemes  to  model 

disturbed  periodicities  as  an  alternative  to  Schuster 
periodogram  analysis  and  its  spurious  periodicities; 
Yule-Walker  (1931)  equations  relate  autoregressive 
coefficients  and  correlations  of  a  stationary  time 
series . 

Wold  (1938)  introduces  infinite  order  autoregressive  and 
moving  average  representations  of  a  stationary  time 
series;  rigorous  conditions  are  given  by  Akutowicz  (1957). 

Mann  and  Wald  (1943)  derive  asymptotic  distribution  of 
estimators  of  autoregressive  coefficients . 

Levinson  (1947)  —  Durbin  (1960)  derive  recursive  methods 
of  solving  Yule-Walker  equations  which  subsequently 
lead  to  fast  algorithms  for  calculation  of  high  order 
AR  schemes . 

Whittle  (1954)  seems  to  be  the  first  to  use  autoregressive 

schemes  to  estimate  a  spectral  density.  He  used  a  low 
order  model  in  a  case  where  high  order  models  are 
indicated  by  order  determining  criterion  [Akaike 
(1974),  p.  720]. 


AO 

Grenander  and  Rosenblatt  (1956)  criticize  attempts  to  apply 
low  order  autoregressive  schemes,  and  develop  theory 
of  non-parametric  spectral  density  estimation,  as  do 
Bartlett,  Parzen,  and  Tukey  and  Blackman. 

Parzen  (1964) ,  Schaerf  (1964) ,  and  Parzen  (1968) 

discuss  autoregressive  spectral  estimation  as  a 
method  for  empirical  time  series  analysis;  no  theory 
is  given. 

Burg  (1967) ,  (1968)  publishes  his  pioneering  work  on  HEM 
(maximum  entropy  method  of  spectral  estimation)  and 
his  method  of  calculating  their  coefficients. 

Akaike  (1969),  (1970)  derives  asymptotic  variance  formulas 
for  autoregressive  spectral  estimators,  and  states 
FPE  (final  prediction  error)  criterion  for  order 
determination;  precursor  of  FPE  in  Davisson  (1965). 

Parzen  (1969)  derives  heuristically  a  formula  for  the 
asymptotic  variance  of  AR  spectral  estimators, 
confirmed  by  Kromer  (1969)  and  Berk  (1974) ;  an  order 
determining  criterion  is  proposed. 

Kromer  (1969)  in  an  unpublished  Ph.D.  thesis  presents  first 
rigorous  analysis  of  asymptotic  distribution  of 
autoregressive  spectral  estimators,  especially  their 
bias;  consistency  is  proved  only  in  an  iterated 
limit  mode  of  convergence  . 

Berk  (1974)  provides  first  proof  of  consistency  of 
autoregressive  spectral  estimators . 
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Carmichael  (1976)  in  an  unpublished  Ph.D.  thesis  provides 
alternative  proof  of  consistency  of  autoregressive 
estimators,  and  extends  technique  to  general  problems 
of  density  estimation. 

Akaike  (1973),  (1974),  (1977)  introduces  AIC  for  model 

order  criterion  and  relates  it  to  entropy  maximization 
principles . 

Parzen  (1974),  (1977)  introduces  CAT  for  AR  order 

determination  based  on  concept  of  finite  parameter  AR 
schemes  as  approximations  to  infinite  parameter  AR 
schemes . 

Hannan  and  Quinn  (1979)  derive  a  modification  of  AIC  which 
provides  consistent  estimators  of  the  AR  order,  when 
exact  model  is  assumed  to  be  a  finite  order  AR. 

Huzii  (1977),  Shibata  (1977),  and  Bhansali  (1980),  discuss 
rigorously  the  convergence  of  AR  spectral  estimators 
and  inverse  correlations. 

Childers  (1978)  and  Haykin  (1979)  contain  very  useful 
collections  of  papers. 

Pagano  (1980),  Beamish  and  Priestley  (1981),  and  Kay  and 
Marple  (1981)  provide  illuminating  reviews  of  AR 
spectral  estimators  and  comparisons  with  alternative 
methods . 
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